Conversation
… full-run mode In full workflow (madengine run --tags), the build phase ran with build_only_mode=True and skipped GPU detection, leaving Dockerfiles that declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an empty value. Users had to manually pass --additional-context every time. - Context.__init__: add detect_local_gpu_arch param (default False), thread it to init_build_context() - Context.init_build_context: add detect_gpu_arch param; when True, reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name() to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg before the build; user-provided value is never overridden; fails gracefully - BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch to Context; add resolved-arch confirmation print in execute() - RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only the full-workflow build path auto-detects; standalone madengine build is unaffected (flag defaults to False) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.
Changes:
- Enable GPU-arch auto-detection in
RunOrchestrator._build_phase()by passing a newdetect_local_gpu_arch=Trueflag intoBuildOrchestrator. - Add
detect_local_gpu_archplumbing throughBuildOrchestratorintoContextand implement optional build-only GPU arch injection inContext.init_build_context(). - Print a “resolved” message when
MAD_SYSTEM_GPU_ARCHITECTUREis present indocker_build_argbefore building.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/madengine/orchestration/run_orchestrator.py |
Turns on build-time GPU arch auto-detection for full build+run workflows by passing detect_local_gpu_arch=True. |
src/madengine/orchestration/build_orchestrator.py |
Adds detect_local_gpu_arch parameter, forwards it into Context, and prints when the arch is resolved. |
src/madengine/core/context.py |
Implements optional build-only detection of GPU arch and injects it into ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"]. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ild_context Five new tests in TestBuildContextGpuArchAutoDetect cover the detect_gpu_arch path added in 78b5110: - arch is injected when MAD_SYSTEM_GPU_ARCHITECTURE is absent - user-provided value is preserved (no override) - UNKNOWN vendor emits a warning and leaves the key unset - detection exceptions are caught and warned, not raised - detect_gpu_arch=False skips vendor detection entirely Also introduces _make_build_only_ctx() helper to safely suppress __init__'s init_build_context call via a scoped patch.object context manager, avoiding the previous broken pattern of calling .stop() on a MagicMock (which is a no-op and left the real method never invoked). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import pytest | ||
| from unittest.mock import Mock, patch | ||
| from unittest.mock import Mock, MagicMock, patch |
There was a problem hiding this comment.
Mock is imported but never used in this test module. This will trigger flake8 F401 (the repo has a flake8 pre-commit hook in .pre-commit-config.yaml). Remove the unused import or use it.
| from unittest.mock import Mock, MagicMock, patch | |
| from unittest.mock import MagicMock, patch |
| resolved_arch = self.context.ctx.get("docker_build_arg", {}).get("MAD_SYSTEM_GPU_ARCHITECTURE") | ||
| if resolved_arch: | ||
| self.rich_console.print( | ||
| f"[green]✓ MAD_SYSTEM_GPU_ARCHITECTURE resolved: {resolved_arch}[/green]\n" | ||
| ) |
There was a problem hiding this comment.
The build phase banner says “(Build-only mode - no GPU detection)”, but when detect_local_gpu_arch is enabled the build-only Context now performs GPU architecture detection (and these new lines print a resolved MAD arch). Consider updating the banner/message to reflect conditional detection so the output isn’t misleading.
| build_only_mode: Whether running in build-only mode (no GPU detection). | ||
| rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm). | ||
| detect_local_gpu_arch: When True and in build_only_mode, attempt to auto-detect | ||
| MAD_SYSTEM_GPU_ARCHITECTURE from the local node and inject it into docker_build_arg. | ||
| Has no effect when build_only_mode=False (runtime mode detects it via init_gpu_context). |
There was a problem hiding this comment.
The build_only_mode arg is described as “no GPU detection”, but build-only mode can now still do GPU architecture detection when detect_local_gpu_arch=True. Consider updating this docstring wording so it matches the new behavior (e.g., “skips runtime GPU context; may optionally detect build arch”).
Summary
madengine run --tags): the build phase previously ran withbuild_only_mode=Trueand skipped GPU detection entirely, soDockerfiles with
ARG MAD_SYSTEM_GPU_ARCHITECTURE(no default) were built with an empty value. Users were forced to manually supply--additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}'on every local run.GPU arch auto-detection — design
Detection fires only in the
RunOrchestrator._build_phase()path (full workflow). The flag isFalseby default so standalonemadengine buildis unaffected.Detection chain reuses existing modules — no new code:
detect_gpu_vendor()(utils/gpu_validator.py) — fast, filesystem-only check, no subprocessget_gpu_tool_manager(vendor, rocm_path)(utils/gpu_tool_factory.py) — singleton factorymanager.get_gpu_architecture()(utils/rocm_tool_manager.py/nvidia_tool_manager.py) — cached, with actionable error messagesnormalize_architecture_name()(execution/dockerfile_utils.py) — consistent formatUser-provided value in
--additional-contextis always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.Test plan
madengine run --tags <tag>on a local GPU node — build phase printsAuto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctlymadengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}'— user value respected, auto-detect skippedmadengine build --tags <tag>on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)madengine run --manifest-file build_manifest.json— run-only path unchangedpytest tests/unit/ tests/e2e/